Automatic Identification of Research Articles from Crawled Documents
نویسندگان
چکیده
Online digital libraries that store and index research articles not only make it easier for researchers to search for scientific information, but also have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data available in digital libraries highly depends on the quality of a classifier that identifies research articles from a set of crawled documents, which in turn depends, among other things, on the choice of the feature representation. The commonly used “bag of words” representation for document classification can result in prohibitively high dimensional input spaces and may not capture the specifics of research articles. In this paper, we propose novel features that result in effective and efficient classification models for automatic identification of research articles. Experimental results on two datasets compiled from the CiteSeer digital library show that our models outperform strong baselines using a significantly smaller number of features.
منابع مشابه
Feedback Prediction for Blogs
The last decade lead to an unbelievable growth of the importance of social media. Due to the huge amounts of documents appearing in social media, there is an enormous need for the automatic analysis of such documents. In this work, we focus on the analysis of documents appearing in blogs. We present a proof-of-concept industrial application, developed in cooperation with Capgemini Magyaroszág K...
متن کاملA Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles
There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...
متن کاملFully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel Corpora
This paper reports the fully automatic compilation of parallel corpora for Brazilian Portuguese. Scientific news texts available in Brazilian Portuguese, English and Spanish are automatically crawled from a multilingual Brazilian magazine. The texts are then automatically aligned at documentand sentence-level. The resulting corpora contain about 2,700 parallel documents totaling over 150,000 al...
متن کاملSpecies Disambiguation for Biomedical Term Identification
An important task in information extraction (IE) from biomedical articles is term identification (TI), which concerns linking entity mentions (e.g., terms denoting proteins) in text to unambiguous identifiers in standard databases (e.g., RefSeq). Previous work on TI has focused on species-specific documents. However, biomedical documents, especially full-length articles, often talk about entiti...
متن کاملContextual information retrieval in research articles: Semantic publishing tools for the research community
In recent years, the dramatic increase in academic research publications has gained significant research attention. Research has been carried out exploring novel ways of providing information services using this research content. However, the task of extracting meaningful information from research documents remains a challenge. This paper presents our research work on developing intelligent inf...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014